We introduce anchored radial observations (ARO), a novel shape encoding for learning neural field representation of shapes that is category-agnostic and generalizable amid significant shape variations. The main idea behind our work is to reason about shapes through partial observations from a set of viewpoints, called anchors. We develop a general and unified shape representation by employing a fixed set of anchors, via Fibonacci sampling, and designing a coordinate-based deep neural network to predict the occupancy value of a query point in space. Differently from prior neural implicit models, that use global shape feature, our shape encoder operates on contextual, query-specific features. To predict point occupancy, locally observed shape information from the perspective of the anchors surrounding the input query point are encoded and aggregated through an attention module, before implicit decoding is performed. We demonstrate the quality and generality of our network, coined ARO-Net, on surface reconstruction from sparse point clouds, with tests on novel and unseen object categories, "one-shape" training, and comparisons to state-of-the-art neural and classical methods for reconstruction and tessellation.
translated by 谷歌翻译
The security of artificial intelligence (AI) is an important research area towards safe, reliable, and trustworthy AI systems. To accelerate the research on AI security, the Artificial Intelligence Security Competition (AISC) was organized by the Zhongguancun Laboratory, China Industrial Control Systems Cyber Emergency Response Team, Institute for Artificial Intelligence, Tsinghua University, and RealAI as part of the Zhongguancun International Frontier Technology Innovation Competition (https://www.zgc-aisc.com/en). The competition consists of three tracks, including Deepfake Security Competition, Autonomous Driving Security Competition, and Face Recognition Security Competition. This report will introduce the competition rules of these three tracks and the solutions of top-ranking teams in each track.
translated by 谷歌翻译
We introduce a lightweight network to improve descriptors of keypoints within the same image. The network takes the original descriptors and the geometric properties of keypoints as the input, and uses an MLP-based self-boosting stage and a Transformer-based cross-boosting stage to enhance the descriptors. The enhanced descriptors can be either real-valued or binary ones. We use the proposed network to boost both hand-crafted (ORB, SIFT) and the state-of-the-art learning-based descriptors (SuperPoint, ALIKE) and evaluate them on image matching, visual localization, and structure-from-motion tasks. The results show that our method significantly improves the performance of each task, particularly in challenging cases such as large illumination changes or repetitive patterns. Our method requires only 3.2ms on desktop GPU and 27ms on embedded GPU to process 2000 features, which is fast enough to be applied to a practical system.
translated by 谷歌翻译
We propose LiDAL, a novel active learning method for 3D LiDAR semantic segmentation by exploiting inter-frame uncertainty among LiDAR frames. Our core idea is that a well-trained model should generate robust results irrespective of viewpoints for scene scanning and thus the inconsistencies in model predictions across frames provide a very reliable measure of uncertainty for active sample selection. To implement this uncertainty measure, we introduce new inter-frame divergence and entropy formulations, which serve as the metrics for active selection. Moreover, we demonstrate additional performance gains by predicting and incorporating pseudo-labels, which are also selected using the proposed inter-frame uncertainty measure. Experimental results validate the effectiveness of LiDAL: we achieve 95% of the performance of fully supervised learning with less than 5% of annotations on the SemanticKITTI and nuScenes datasets, outperforming state-of-the-art active learning methods. Code release: https://github.com/hzykent/LiDAL.
translated by 谷歌翻译
真实世界的文本应用程序通常涉及组成广泛的文本控制操作,例如编辑文本W.R.T.属性,操纵关键字和结构,并生成所需属性的新文本。事先的工作通常会学习/芬太尼语言模型(LM)以执行操作的个人或特定子集。最近的研究以插件方式研究了合并操作,通常在复杂序列空间中以昂贵的搜索或优化进行了研究。本文提出了一种新的有效方法,用于在紧凑的文本潜在空间中进行可复合的文本操作。文本潜在矢量的低维度和不同性使我们能够基于给定的任意插入运算符(例如属性分类器)基于普通微分方程(ODE)开发有效的采样器。通过通过有效的适应性将预告片的LMS(例如GPT2)连接到潜在空间,然后我们将采样向量解码为所需的文本序列。灵活的方法允许使用来自不同域中的任何相关数据获取的各种控制操作员(情感,时态,形式,关键字等)。实验表明,在我们的方法中构成这些操作员可以生成或编辑高质量文本,从而在发电质量和效率方面显着改善了以前的方法。
translated by 谷歌翻译
近年来,由于强大的3D CNN,基于体素的方法已成为室内场景3D语义分割的最新方法。然而,基于体素的方法忽略了基础的几何形状,由于缺乏地理位置信息而在空间上闭合物体上的模棱两可的特征遭受了含糊的特征,并努力处理复杂和不规则的几何形状。鉴于此,我们提出了Voxel-Mesh网络(VMNET),这是一种新颖的3D深度体系结构,该架构在Voxel和网格表示上运行,并利用了欧几里得和地球信息。从直觉上讲,从体素中提取的欧几里得信息可以提供代表附近对象之间交互的上下文提示,而从网格中提取的地理信息可以帮助空间上接近但断开表面的分离对象。为了合并两个域中的此类信息,我们设计了一个内域的专注模块,以进行有效的特征聚集和一个用于自适应特征融合的专注于域间的模块。实验结果验证了VMNET的有效性:具体而言,在具有挑战性的扫描仪数据集上,用于大规模的室内场景分割,它的表现优于最先进的Sparseconvnet和Minkowskownet(74.6%vs 72.5%和73.6%)更简单的网络结构(17m vs 30m和38m参数)。代码发布:https://github.com/hzykent/vmnet
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译
People with diabetes are more likely to develop diabetic retinopathy (DR) than healthy people. However, DR is the leading cause of blindness. At present, the diagnosis of diabetic retinopathy mainly relies on the experienced clinician to recognize the fine features in color fundus images. This is a time-consuming task. Therefore, in this paper, to promote the development of UW-OCTA DR automatic detection, we propose a novel semi-supervised semantic segmentation method for UW-OCTA DR image grade assessment. This method, first, uses the MAE algorithm to perform semi-supervised pre-training on the UW-OCTA DR grade assessment dataset to mine the supervised information in the UW-OCTA images, thereby alleviating the need for labeled data. Secondly, to more fully mine the lesion features of each region in the UW-OCTA image, this paper constructs a cross-algorithm ensemble DR tissue segmentation algorithm by deploying three algorithms with different visual feature processing strategies. The algorithm contains three sub-algorithms, namely pre-trained MAE, ConvNeXt, and SegFormer. Based on the initials of these three sub-algorithms, the algorithm can be named MCS-DRNet. Finally, we use the MCS-DRNet algorithm as an inspector to check and revise the results of the preliminary evaluation of the DR grade evaluation algorithm. The experimental results show that the mean dice similarity coefficient of MCS-DRNet v1 and v2 are 0.5161 and 0.5544, respectively. The quadratic weighted kappa of the DR grading evaluation is 0.7559. Our code will be released soon.
translated by 谷歌翻译
Zero-Shot Learning has been a highlighted research topic in both vision and language areas. Recently, most existing methods adopt structured knowledge information to model explicit correlations among categories and use deep graph convolutional network to propagate information between different categories. However, it is difficult to add new categories to existing structured knowledge graph, and deep graph convolutional network suffers from over-smoothing problem. In this paper, we provide a new semantic enhanced knowledge graph that contains both expert knowledge and categories semantic correlation. Our semantic enhanced knowledge graph can further enhance the correlations among categories and make it easy to absorb new categories. To propagate information on the knowledge graph, we propose a novel Residual Graph Convolutional Network (ResGCN), which can effectively alleviate the problem of over-smoothing. Experiments conducted on the widely used large-scale ImageNet-21K dataset and AWA2 dataset show the effectiveness of our method, and establish a new state-of-the-art on zero-shot learning. Moreover, our results on the large-scale ImageNet-21K with various feature extraction networks show that our method has better generalization and robustness.
translated by 谷歌翻译
Due to the high activation sparsity and use of accumulates (AC) instead of expensive multiply-and-accumulates (MAC), neuromorphic spiking neural networks (SNNs) have emerged as a promising low-power alternative to traditional DNNs for several computer vision (CV) applications. However, most existing SNNs require multiple time steps for acceptable inference accuracy, hindering real-time deployment and increasing spiking activity and, consequently, energy consumption. Recent works proposed direct encoding that directly feeds the analog pixel values in the first layer of the SNN in order to significantly reduce the number of time steps. Although the overhead for the first layer MACs with direct encoding is negligible for deep SNNs and the CV processing is efficient using SNNs, the data transfer between the image sensors and the downstream processing costs significant bandwidth and may dominate the total energy. To mitigate this concern, we propose an in-sensor computing hardware-software co-design framework for SNNs targeting image recognition tasks. Our approach reduces the bandwidth between sensing and processing by 12-96x and the resulting total energy by 2.32x compared to traditional CV processing, with a 3.8% reduction in accuracy on ImageNet.
translated by 谷歌翻译